1 Introduction

First, we again create 1 complete dataset, which is done by means of imputing the boys dataset in mice (Van Buuren & Groothuis-Oudshoorn, 2011) with default settings, and complete the dataset. Then, based on this one true dataset, we extract the regression coefficients of a linear regression model in which weight is regressed on age and height, that are used to compare the estimates of the synthetic versions of the datasets. The true regression coefficients are equal to \(\beta_0 = -9.216\), \(\beta_{age} = 2.327\), \(\beta_{height} = 0.191\). Then, for a total of 200 iterations, we impute a matrix with the same dimensions as the boys dataset (that is, the sample size \(n = 748\) and the number of predictors \(k = 9\)). This is done once with the default settings, that is, with pmm for the continuous variables, polr for binary variables and polyreg for ordinal variables with more than two categories. Furthermore, bmi is imputed passively, since it consists of a fixed combination of height and weight bmi = (wgt / (hgt/100)^2. Furthermore, the predictor matrix is adjusted so that imputations for bmi do not flow back into predictions of hgt and wgt. Then, for every synthetic dataset, the estimates, and corresponding variance of the estimates are calculated, as well as the \(95\%~CI\) are calculated. Furthermore, an indicator is added, whether or not the confidence interval includes the true parameter estimates. Note that these confidence intervals are still based on imputations for missing data, instead of imputations for synthetic data. This is due to the fact that the calculation for the degrees of freedom for the estimates based imputing synthetic values might yield degrees of freedom that are so small that the corresponding \(95\% ~ CI\) yields boundary values of \(-\infty\) and \(\infty\).


2 Simulations


2.1 Single dataset approach


2.1.1 True results

broom::tidy(truemodel, conf.int=TRUE) %>%
  mutate(CIW = conf.high - conf.low) %>% 
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term estimate std.error statistic p.value conf.low conf.high CIW
(Intercept) -9.216 2.033 -4.534 0 -13.207 -5.226 7.982
age 2.327 0.189 12.335 0 1.956 2.697 0.741
hgt 0.191 0.028 6.843 0 0.136 0.246 0.110

2.1.2 Use mice to synthesize and to pool


2.1.2.1 Pool the results of the synthetic datasets with the default mice settings.

default_fit <- syns_def %>%
  map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
  map_dfr(function(x) {
    var      <- x[,1]
    true_est <- coefs
    est      <- x[,2]
    true_se  <- sqrt(diag(vcov(truemodel)))
    se       <- x[,3]
    df       <- x[,5]
    lower    <- est + se * qt(.025, df)
    upper    <- est + se * qt(.975, df)
    cov      <- lower < coefs & coefs < upper
  
    bind_cols(var = var, true_est = true_est, est = est, true_se = true_se, 
                               se = se, df = df, lower = lower, upper = upper, cov = cov)
    })

results_def <- default_fit %>%
  group_by(var) %>%
  summarise("True Est" = unique(true_est),
            "Syn Est"  = mean(est),
            "Bias"     = mean(est - true_est),
            "True SE"  = unique(true_se),
            "Syn SE"   = mean(se),
            "df"       = mean(df),
            "Lower"    = mean(lower),
            "Upper"    = mean(upper),
            "CIW"      = mean(upper - lower),
            "Coverage" = mean(cov))

2.1.2.2 Pool the results of the synthetic dataset with CART.

cart_fit <- syns_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt) %>% pool %>% summary) %>%
  map_dfr(function(x) {
    var      <- x[,1]
    true_est <- coefs
    est      <- x[,2]
    true_se  <- sqrt(diag(vcov(truemodel)))
    se       <- x[,3]
    df       <- x[,5]
    lower    <- est + se * qt(.025, df)
    upper    <- est + se * qt(.975, df)
    cov      <- lower < coefs & coefs < upper

    bind_cols(var = var, true_est = true_est, est = est, true_se = true_se,
              se = se, df = df, lower = lower, upper = upper, cov = cov)
    })

 results_cart <- cart_fit %>%
   group_by(var) %>%
   summarise("True Est" = unique(true_est),
             "Syn Est"  = mean(est),
             "Bias"     = mean(est - true_est),
             "True SE"  = unique(true_se),
             "Syn SE"   = mean(se),
             "df"       = mean(df),
             "Lower"    = mean(lower),
             "Upper"    = mean(upper),
             "CIW"      = mean(upper - lower),
             "Coverage" = mean(cov))

2.1.2.3 Results for both synthesization methods

bind_rows("Mice default" = results_def,
          "Mice with cart" = results_cart, .id = "Imputation method") %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Imputation method var True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
Mice default (Intercept) -9.216 -2.160 7.057 2.033 2.941 28.084 -8.657 4.338 12.995 0.338
Mice default age 2.327 2.948 0.621 0.189 0.285 18.624 2.304 3.591 1.287 0.490
Mice default hgt 0.191 0.094 -0.097 0.028 0.041 22.553 0.003 0.186 0.183 0.388
Mice with cart (Intercept) -9.216 -7.632 1.585 2.033 2.691 43.924 -13.312 -1.952 11.361 1.000
Mice with cart age 2.327 2.460 0.133 0.189 0.259 37.111 1.904 3.015 1.112 1.000
Mice with cart hgt 0.191 0.170 -0.021 0.028 0.038 37.963 0.089 0.251 0.162 1.000

2.1.3 Use mice to synthesize with pooling rules from Drechsler

For now, we forget about the default mice imputation method, because it leads to flawed results. Therefore, we continue with the CART results only. We fit the lm model in which wgt is regressed on age and hgt.

## pool.lm.syn

## If you have a multiply imputed synthetic dataset, that is, multiple fully imputed 
## datasets that are generated by mice, using an all-ones matrix as the where matrix, 
## this function can be used to analyse the synthetic data by means of a linear 
## regression model, and to pool the results.

pool.syn <- function(mira) {
  
  fitlist <- mira %$% analyses
  
  m <- length(fitlist)
  
  pooled <- fitlist %>% 
    map_dfr(broom::tidy) %>%
    group_by(term) %>%
    summarise(est     = mean(estimate),
              bm      = sum((estimate - est)^2) / (m - 1),
              ubar    = mean(std.error^2),
              var_u   = (1 + 1/m) * bm - ubar,
              var     = if_else(var_u > 0, var_u, ubar),
              df      = max(1, (m - 1) * (1 - ubar / (bm + bm/m))^2),
              lower   = est - qt(.975, df) * sqrt(var),
              upper   = est + qt(.975, df) * sqrt(var), .groups = 'drop')
  pooled
}

ci_cov <- function(pooled, true_fit) {
  
  coefs <- coef(true_fit)
  vars   <- diag(vcov(true_fit))
  
  nsim <- nrow(pooled) / length(coefs)
  
  pooled %>% mutate(true_coef = rep(coefs, nsim),
                    true_var  = rep(vars, nsim),
                    cover     = lower < true_coef & true_coef < upper) %>%
    group_by(term) %>%
    summarise("True Est" = unique(true_coef),
              "Syn Est"  = mean(est),
              "Bias"     = mean(est - true_coef),
              "True SE"  = unique(sqrt(true_var)),
              "Syn SE"   = mean(sqrt(var)),
              "df"       = mean(df),
              "Lower"    = mean(lower),
              "Upper"    = mean(upper),
              "CIW"      = mean(upper - lower),
              "Coverage" = mean(cover))
}
models <- syns_cart %>% map(function(x) x %$% lm(wgt ~ age + hgt))

Then, we use the custom pooling function to pool the estimates.

pooled <- models %>% map_dfr(pool.syn)

And another custom function to summarise all results, and obtain the confidence interval coverage.

ci_cov(pooled, truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -7.632 1.585 2.033 1.770 32.803 -20.650 5.386 26.036 0.996
age 2.327 2.460 0.133 0.189 0.164 67.320 1.154 3.765 2.610 1.000
hgt 0.191 0.170 -0.021 0.028 0.024 44.089 -0.020 0.359 0.379 0.998

It can be seen that the coverage of the confidence interval is nearly equal to one. Since we currently only want to make inferences with regard to the sample, this is okay. Namely, if the CI coverage with regard to the sample would be \(95\%\), the CI coverage with regard to the population would be \(0.9025\%\). Inferences with regard to the population will be discussed in the next paragraph.

However, what is questionable, is the fact that the confidence interval width has increased, even though the variance has decreased. Since the confidence interval width is a function of only the standard error and the degrees of freedom, this must be due to the fact that the degrees of freedom are smaller than the degrees of freedom as returned by the mice pooling function.

mice_df <- data.frame(term = cart_fit$var, df = cart_fit$df)
syn_df <- data.frame(term = pooled$term, df = pooled$df)

df <- bind_rows("Mice" = mice_df, "Synthetic" = syn_df, .id = "Method")

ggplot(data = df, aes(x = df, color = Method, fill = Method)) +
  geom_density(alpha = .7) +
  facet_wrap(~ term, nrow = 1, scales = "free") +
  scale_color_viridis_d(begin = .30, end = .65) +
  scale_fill_viridis_d(begin = .30, end = .65) +
  theme_classic()


2.1.4 Different synthesizing sequence

Now we have established that the mice synthesizing algorithm works as good as the synthpop synthesizing algorithm, we can check whether changing the order in which hgt and wgt are synthesized influences the result of the algorithm.

cart_wgt_hgt %>% 
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -7.605 1.612 2.033 1.776 35.370 -20.822 5.612 26.434 1.000
age 2.327 2.463 0.137 0.189 0.165 48.288 1.109 3.818 2.709 0.998
hgt 0.191 0.169 -0.022 0.028 0.024 32.164 -0.023 0.362 0.385 0.998

2.2 Bootstrapped boys data

synthetic_results <- boot_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %T>% 
  assign("boot_syn", ., envir = globalenv()) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

norm_results <- boot_cart %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  mutate(df = NA,
         lower = est - qnorm(.975) * sqrt(var),
         upper = est + qnorm(.975) * sqrt(var)) %>%
  ci_cov(truemodel) %>%
  kable(digits = 3, caption = "Normal CI approximation") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

true_results <- bootstrap_boys %>%
  map(~ lm(wgt ~ age + hgt, .x)) %>%
  map_dfr(~ broom::tidy(.x, conf.int = TRUE)) %>%
  mutate(true_coef = rep(coef(truemodel), nsim),
         true_se   = rep(sqrt(diag(vcov(truemodel))), nsim),
         cover     = conf.low < true_coef & true_coef < conf.high) %T>% 
  assign("boot_true", ., envir = globalenv()) %>%
  group_by(term) %>%
  summarise("True Est" = unique(true_coef),
            "Boot Est"  = mean(estimate),
            "Bias"     = mean(estimate - true_coef),
            "True SE"  = unique(true_se),
            "Boot SE"   = mean(std.error),
            "DF"       = 745,
            "Lower"    = mean(conf.low),
            "Upper"    = mean(conf.high),
            "CIW"      = mean(conf.high - conf.low),
            "Coverage" = mean(cover)) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
term True Est Boot Est Bias True SE Boot SE DF Lower Upper CIW Coverage
(Intercept) -9.216 -9.301 -0.084 2.033 2.025 745 -13.275 -5.326 7.949 0.934
age 2.327 2.314 -0.013 0.189 0.188 745 1.945 2.683 0.738 0.932
hgt 0.191 0.193 0.001 0.028 0.028 745 0.138 0.247 0.109 0.922
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -8.101 1.116 2.033 1.809 141.821 -19.147 2.946 22.092 0.962
age 2.327 2.419 0.092 0.189 0.164 74.606 1.315 3.523 2.208 0.966
hgt 0.191 0.176 -0.015 0.028 0.024 86.320 0.021 0.331 0.310 0.968
Normal CI approximation
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -8.101 1.116 2.033 1.809 NA -11.646 -4.555 7.090 0.842
age 2.327 2.419 0.092 0.189 0.164 NA 2.098 2.740 0.643 0.824
hgt 0.191 0.176 -0.015 0.028 0.024 NA 0.129 0.223 0.094 0.810

We find that some bias has been introduced due to synthesizing the relationships, so we investigate this in further depth. Additionally, the coverage is somewhat too high, and the average confidence interval width has increased dramatically. Therefore, we will further investigate where this increase stems from.

plot_dat <- bind_rows("Synthetic" = boot_syn %>% 
                                    mutate(ciw = upper - lower) %>%
                                    select(term, est, ciw = ciw), 
                      "Real values" = boot_true %>% 
                                      mutate(ciw = conf.high - conf.low) %>%
                                      select(term, est = estimate, ciw = ciw),
                      .id = "Method")


ggplot(plot_dat, aes(x = est, color = Method, fill = Method)) +
  geom_density(alpha = .7) +
  facet_wrap(~ term, nrow = 1, scales = "free") +
  scale_color_viridis_d(begin = .30, end = .65) +
  scale_fill_viridis_d(begin = .30, end = .65) +
  theme_classic()

It can be seen that the bias is systematic, and thus not due to any extreme values that pull the average estimate to zero. Additionally, it seems that the pooled standard error indicated in the table is somewhat too low for the synthetic data. Namely, it appears in the plot that the spread in the synthetic data is roughly equal to the spread in the bootstrapped, but real, data. However, in the table above, the synthetic standard errors are much smaller than the mean over the manually calculated standard error of the bootstrapped estimates. When the standard errors are calculated over the obtained regression estimates, we find that the synthetic standard error increases much more when we calculate it over the observed synthetic estimates.

boot_se <- boot_true %>% group_by(term) %>% summarise("Bootstrapped SE" = sd(estimate))
boot_syn_se <- boot_syn %>% group_by(term) %>% summarise("Synthetic bootstrapped SE" = sd(est))

bind_cols(boot_se, boot_syn_se[,2]) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
term Bootstrapped SE Synthetic bootstrapped SE
(Intercept) 2.105 2.046
age 0.207 0.202
hgt 0.030 0.030
ggplot(plot_dat, aes(x = ciw, color = Method, fill = Method)) +
  geom_density(alpha = .7) +
  facet_wrap(~ term, nrow = 1, scales = "free") +
  scale_color_viridis_d(begin = .30, end = .65) +
  scale_fill_viridis_d(begin = .30, end = .65) +
  theme_classic()

When we look at the confidence interval width, we find that the bootstrap CI is of approximately the same length in all iterations, while the CI’s of the synthetic results spread out much more.


2.3 Improving the bias/variance tradeoff

The fact that there is bias introduced, might be due to the fact that there is too little variance in the estimates of the synthetic values. Therefore, an approach that increases the variance of the estimates might yield estimates that display less bias. This can be achieved by decreasing the number of iterations, that is, setting the mice parameter maxit to 1, instead of 5. Additionally, the complexity parameter can be set to \(10^{-8}\) instead of \(10^{-4}\), so that more splits will be made. Also, the value of the rpart parameter minbucket, that is, the minimum number of observations in any terminal node, is set to 3 instead of 5, so that, once again, more splits can be made. By means of these changes, the within variance increases, while the between-imputations variability decreases. The results with 500 iterations on the same, singly imputed boys dataset can be seen below.

syns_cart_maxit_cp_min3 %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  ci_cov(., truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
## `summarise()` ungrouping output (override with `.groups` argument)
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -8.357 0.859 2.033 1.825 63.626 -19.388 2.674 22.062 1
age 2.327 2.406 0.080 0.189 0.167 84.422 1.224 3.589 2.366 1
hgt 0.191 0.179 -0.012 0.028 0.025 64.223 0.012 0.346 0.334 1

It can be seen that the bias indeed decreases, while the standard error of the estimates increases. This is due to the fact that the within-imputation variance increases, while the between-imputation variance decreases somewhat. When we use the same approach to estimate the bootstrapped boys data, we get the following results.

boot_cart_maxit_cp_min3 %>%
  map(function(x) x %$% lm(wgt ~ age + hgt)) %>%
  map_dfr(pool.syn) %>%
  ci_cov(., truemodel) %>%
  kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
## `summarise()` ungrouping output (override with `.groups` argument)
term True Est Syn Est Bias True SE Syn SE df Lower Upper CIW Coverage
(Intercept) -9.216 -9.051 0.165 2.033 1.968 330.899 -16.740 -1.362 15.379 0.962
age 2.327 2.336 0.009 0.189 0.178 213.036 1.490 3.181 1.691 0.960
hgt 0.191 0.189 -0.002 0.028 0.027 236.766 0.066 0.312 0.246 0.962

References

Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1–67. Retrieved from https://www.jstatsoft.org/v45/i03/